194 research outputs found

    Memory Efficient Multithreaded Incremental Segmented Sieve Algorithm

    Full text link
    Prime numbers are fundamental in number theory and play a significant role in various areas, from pure mathematics to practical applications, including cryptography. In this contribution, we introduce a multithreaded implementation of the Segmented Sieve algorithm. In our implementation, instead of handling large prime ranges in one iteration, the sieving process is broken down incrementally, which theoretically eliminates the challenges of working with large numbers, and can reduce memory usage, providing overall more efficient multi-core utilization over extended computations.Comment: 10 page

    Path planning for socially-aware humanoid robots

    Get PDF
    Designing efficient autonomous navigation systems for mobile robots involves consideration of the robotís environment while arriving at a systems architecture that trades off multiple constraints. We have architected a navigation framework for socially-aware autonomous robot navigation, using only the on-board computing resources. Our goal is to foster the development of several important service robotics applications using this platform. Our framework allows a robot to autonomously navigate in indoor environments while accounting for people (i.e., estimating the path of all individuals in the environment), respecting each individualís private space. In our design, we can leverage a wide number of sensors for navigation, including cameras, 2D and 3D scanners, and motion trackers. When designing our sensor system, we have considered that mobile robots have limited resources (i.e., power and computation) and that some sensors are costlier than others (e.g., cameras and 3D scanners stream data at high rates), requiring intensive computation to provide useful insight for real-time navigation. We tradeoff between accuracy, responsiveness, and power, and choose a Hokuyo UST-20LX 2D laser scanner for robot localization, obstacle detection and people tracking. We use an MPU-6050 for motion tracking. Our navigation framework features a low-power sensor system (< 5W) tailored for improved battery life in robotic applications while providing sufficient accuracy. We have completed a prototype for a Human Support Robot using the available onboard computing devices, requiring less than 60W to run. We estimate we can obtain similar performance, while reducing power by ~60%, utilizing low-power high-performance accelerator hardware and parallelized software.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec

    Vega: A Computer Vision Processing Enhancement Framework with Graph-based Acceleration

    Get PDF
    The popularity of Computer Vision (CV) algorithms has been on the rise given their growing dependence on machine learning and deep neural networks. The resulting improvement in inference accuracy has revolutionized a number of fields. However, given that CV algorithms consist of many different stages, each having different computing characteristics, their execution is frequently irregular and inefficient, unable to leverage the full potential of the computing platform. Presently, supporting real-time video processing for high resolution images on edge systems involves a significant amount of programming effort and performance tuning. To overcome this challenge, we present Vega, a parallel graph-based framework that enables better utilization of multi-core edge computing platforms. Vega provides a highly flexible and user-friendly interface to execute a range of CV algorithms efficiently, leveraging multiple external libraries for performance. First, Vega maps independent stages of a CV algorithm to nodes in a pipeline graph. Next, it dynamically schedules nodes on a multi-core CPU using multi-threading. From our experimental results, our framework improves performance of all selected algorithms by at least 1.75x and up to 4.82x on the same platform. We analyze the impact of using our framework in terms of hardware utilization, frame processing latency and throughput

    Defensive Dropout for Hardening Deep Neural Networks under Adversarial Attacks

    Full text link
    Deep neural networks (DNNs) are known vulnerable to adversarial attacks. That is, adversarial examples, obtained by adding delicately crafted distortions onto original legal inputs, can mislead a DNN to classify them as any target labels. This work provides a solution to hardening DNNs under adversarial attacks through defensive dropout. Besides using dropout during training for the best test accuracy, we propose to use dropout also at test time to achieve strong defense effects. We consider the problem of building robust DNNs as an attacker-defender two-player game, where the attacker and the defender know each others' strategies and try to optimize their own strategies towards an equilibrium. Based on the observations of the effect of test dropout rate on test accuracy and attack success rate, we propose a defensive dropout algorithm to determine an optimal test dropout rate given the neural network model and the attacker's strategy for generating adversarial examples.We also investigate the mechanism behind the outstanding defense effects achieved by the proposed defensive dropout. Comparing with stochastic activation pruning (SAP), another defense method through introducing randomness into the DNN model, we find that our defensive dropout achieves much larger variances of the gradients, which is the key for the improved defense effects (much lower attack success rate). For example, our defensive dropout can reduce the attack success rate from 100% to 13.89% under the currently strongest attack i.e., C&W attack on MNIST dataset.Comment: Accepted as conference paper on ICCAD 201

    Instruction replication for clustered microarchitectures

    Get PDF
    This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-cluster communications can be reduced by selectively replicating an appropriate set of instructions. However, instruction replication must be done carefully since it may also degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo scheduling algorithm that effectively reduces communications. Results show that the number of communications can decrease using replication, which results in significant speed-ups. IPC is increased by 25% on average for a 4-cluster microarchitecture and by as mush as 70% for selected programs.Peer ReviewedPostprint (published version

    Hardware support for Local Memory Transactions on GPU Architectures

    Get PDF
    Graphics Processing Units (GPUs) are popular hardware accelerators for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. However, the SIMT execution model is not efficient when code includes critical sections to protect the access to data shared by the running threads. In addition, GPUs offer two shared spaces to the threads, local memory and global memory. Typical solutions to thread synchronization include the use of atomics to implement locks, the serialization of the execution of the critical section, or delegating the execution of the critical section to the host CPU, leading to suboptimal performance. In the multi-core CPU world, transactional memory (TM) was proposed as an alternative to locks to coordinate concurrent threads. Some solutions for GPUs started to appear in the literature. In contrast to these earlier proposals, our approach is to design hardware support for TM in two levels. The first level is a fast and lightweight solution for coordinating threads that share the local memory, while the second level coordinates threads through the global memory. In this paper we present GPU-LocalTM as a hardware TM (HTM) support for the first level. GPU-LocalTM offers simple conflict detection and version management mechanisms that minimize the hardware resources required for its implementation. For the workloads studied, GPU-LocalTM provides between 1.25-80X speedup over serialized critical sections, while the overhead introduced by transaction management is lower than 20%.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

    Get PDF
    Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information about the sample being studied. OQM is part of an imaging framework developed by the Optical Science Laboratory at Northeastern University. In one particular application of interest, the framework is used to extract phase information from the image of an embryo to determine embryo viability. Phase Unwrapping is the process of reconstructing the real phase shift (propagation delay) of a sample from the measured “wrapped“ representation which is between −π and +π. Unwrapping can be done using the Minimum L P Norm Phase Unwrap algorithm. Images are first preprocessed using an Affine Transform before they are unwrapped. Both of these steps are time consuming and would benefit greatly from parallelization and acceleration. Faster processing would lower many research barriers (in terms of throughpu
    corecore